1. Introduction

Background

NMBS/SNCB trains, which is the biggest railway serving the country of Belgium, needs to postpone the expansion because of lack of train personnel staff. Also, the raise on energy cost increase the operating cost. Meanwhile, the company still face revenue loss because of low passengers on some routes.

With these problems, the company can strategically plan their policies on optimizing resource allocation, enhancing operational cost efficiency, and also increasing revenue especially on their low occupancy routes. But how can they do that?

Use Case

This project is aim to predict occupancy levels (categorized by high and low) for different Origin-Destination (OD) pairs of Belgium railway system to allocate resources such as train frequency and size effectively to help transportation planners of NMBS/SNCB or the Transportation Department in Belgium.

e.g. extend/merge train routes, minimize vacant trains or mitigate crowdedness of trains etc.

Question:

What is the level of occupancy of a certain OD pair at a specific time?

What options do planners have to manage this demand?

e.g. decrease amount of trains, merge of two similar routes, or decrease the frequent of the route to change low routes to medium, and of increase amount of trains, separate two similar routes, or increase the frequent of the route to change high routes to medium.

We build this web-based application, called Re-Train that predict the occupancy rate across the routes with three stages, first analysing and visualizing past data, creating long-term or short-term predictions based on the users’ need, and last, showcasing potential benefits and risk on the predictions.

This is the interface which shows the latest occupancy rate where orange shows low occupancy routes. First feature is History trend analysis. On the top-right corner, user can filter based on those variables and click analysis, to see graph and maps of the train occupancy level in the past.

Second feature is prediction. you can predict by filtering like the first one, and choose the period time based on your needs. After clicking the predict, the predicted occupancy rate will result in the percentage of each level and be shown in the map, and also the accuracy percentage & graph below the map.

Last feature, you can check on the cost benefit analysis based on the scenario of the prediction which will show you the potential benefit & potential risk of optimizing routes on those period of time. Then, you can export the report to help you make decision based on that.

How it works?

We use Kaggle train occupancy level data on 2016 which have these variables that potentially correlate with the occupancy rate. And add some external datas like weather, demographic, and fare, and then predict using logistic regression.

Taking a look at the data, we can see a clear distribution between high and low occupancy where: Low occupancy routes are most prominent during off-peak hours, on weekends, and scattered throughout the date but peak on early July-August. Meanwhile high occupancy routes are strong during commuter hour, on weekdays, and spikes on mid-late October.

From this, we can see high-demand OD pairs are consistent, while low-demand OD pairs keep on changing. High occupancy is concentrated on major intercity and commuter routes. Meanwhile, smaller or regional routes shows reduced demand. But there are two quite big OD Pairs that dominate low-occupancy, possibly indicating a mismatch in train frequency with actual demand.

The modeling approach uses binomial regression to predict train occupancy (high or low) while identifying key factors that are validated through k-fold cross-validation for robustness. Also, a confusion matrix will assess the model’s predictive accuracy and reliability which is shown in the app through accuracy percentage and graph.

To minimize the risks of misclassifying high-occupancy routes as low, we set a higher threshold to ensure only routes with a strong likelihood of low occupancy are chosen just like shown on the animation example, It will prioritize service reliability and passenger satisfaction.

By correctly predicting low-occupancy routes, we can achieve significant cost savings in maintenance, labor, and energy while avoiding wasted resources. However, misclassifying high-occupancy as low can lead to revenue loss, overcrowding, and passenger dissatisfaction. Striking the right balance in predictions ensures optimized resource allocation and supports a sustainable and efficient rail network.

2. Data Setup and Wrangling

Load Libraries

We loaded libraries that are required to analyze and visualize the data as shown in the code below.

knitr::opts_chunk$set(echo = TRUE)
#install.packages("hms")
#install.packages("BelgiumStatistics", repos = "http://www.datatailor.be/rcube", type = "source")
#install.packages("devtools")
#library(devtools)
#devtools::install_github("jwijffels/BelgiumMaps.Admin", build_vignettes = TRUE)
#devtools::install_github("jwijffels/StatisticsBelgium", build_vignettes = TRUE)
#library(BelgiumStatistics)
library(tidyverse)
library(tidycensus)
library(sf)
library(knitr)
library(kableExtra)
library(caret)
library(pscl)
library(mapview)
library(dplyr)
library(scales)
library(viridis)
library(spdep)
library(caret)
library(plotROC)
library(pROC)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(tufte)
library(rmarkdown)
library(pander)
library(classInt)
library(ggplot2)
library(units)
library(leaflet)
library(lubridate)
library(hms)
library(riem) # for weather
library(readxl) # for read xlsx
library(ggtext)
library(showtext) # to set up font in plots
font_add_google("Roboto", "roboto") # to set up font in plots
showtext_auto(TRUE)
library(geosphere) # to calculate distance from two different geolocation (lat, lng)

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
colorPallete <- c("low" = "#df543b", "medium" = "#a1b7b8", "high" = "#a1b7b8")
plotTheme <- theme(
  plot.title = element_text(face="bold", hjust = 0, size=15, lineheight=0.8),
  plot.subtitle = element_text(hjust = 0, size = 10, face = "italic", lineheight=0.8, margin = margin(b = 3, t = 6)),  
        plot.caption = element_text(size = 10, hjust = 0, lineheight=0.9),
        plot.margin = margin(1.7, 1.7, 1.7, 1.7),
        text = element_text(family = "roboto"),
        legend.title = element_text(size = 10))

Load Data

We loaded three default data (1)-(3) of NMBS trains, stations, and lines from Kaggle, and four additional data (4)-(7) of Belgium statistical tracts, Belgium population data, zone station data, and weather data from RIEM package.

  1. Trains dataset

The training dataset of trains provides date, time, origin station code (from), destination station code (to), vehicle ID, and occupancy information shown at three different levels (low, medium, high), from July 27, 2016 to October 29, 2016. The test dataset of trains have the same columns as the training dataset, but from October 29, 2016 to December 19, 2016. The test dataset does not have occupancy level information.

  1. Stations dataset

The stations dataset contains names, locations (longitude and latitude), and average stop times (avg_stop_times). Since NMBS operates trains in Belgium as well as France and the Netherlands, we filtered the dataset to exclude non-Belgian stations (e.g. those in the Netherlands, France, Luxembourg etc.). Additionally, we created columns to calculate the distance between origin and destination stations, the distance from Belgium to the origin and destination stations, and the bearing between origin and destination stations.

  1. Lines dataset

The lines dataset includes information about the vehicle ID, vehicle type, the number of stops, and stopping station IDs. Since the information in the train dataset is insufficient, we did not use stopping station IDs in this analysis.

  1. Belgium statistical tracts dataset

The Belgium statistical tracts dataset contains the geometries of each tract. We gained this dataset from Statbel, the Belgian statistical office. We joined this dataset to calculate the population of Belgium by municipality and intersect this dataset to the stations and the weather stations.

  1. Belgium population dataset

We assumed that the population of the origin and destination regions would affect the occupancy level of the trains. We used the Belgium population dataset to calculate the population of Belgium by municipality and joined this dataset to the statistical tracts dataset.

  1. Zone station dataset

The zone station dataset contains the zone information of each train station. At first, we assumed the fare information may be one of the independent variable so we added this information to the stations dataset to calculate the fare. However, we could not find the opened fare information of NMBS, so we did not use this information in the analysis.

  1. Weather dataset

We assumed that the weather conditions of the origin and destination regions at specific dates and times would affect the occupancy level of the trains. The dataset is from RIEM, which is developed by Iowa Environment Mesonet. Since RIEM provides global coverage, we were able to get temperature, wind, and relative humidity data from 12 weather stations located in Belgium airports. Since the percipitation data of Belgium was unavailable in RIEM, we used relative humidity as a substitute variable to reflect general patterns related to precipitation.

# Load (1) trains, (2) stations, and (3) SNCB lines data
line_info <- read.csv("./Data/line_info.csv") 
trains_test <- read.csv("./Data/trains_test.csv") 
trains_train <- read.csv("./Data/trains_train.csv") 
stations <- read.csv("./Data/stations.csv")

# Load (4) Belgium statistical tracts and (5) population data
statBel <- st_read("./Data/sh_statbel_statistical_sectors_3812_20240101.geojson") %>%
  st_transform(4326) 
popBel <- read_excel("./Data/TF_POP_STRUCT_SECTORS_2016.xlsx") # 2016 Belgium population data by tracts

# Load Zone station data to calculate fare
stationZone <- read_excel("./Data/station-zone.xlsx") 

# Filter stations data to exclude non-Belgian stations (Netherlands, France, etc.)
stations <- stations %>%
  filter(country.code=="be")
stations <- left_join(stations, stationZone, by = c("name" = "Station")) # Add zone information to stations)

# Stat tracts and population of Belgium by municipality
statBel_muni <- statBel %>%
  group_by(cd_dstr_refnis) %>%
  summarize(geometry = st_union(geometry)) %>%
  ungroup()

popBel_muni <- popBel %>%
  mutate(CD_REFNIS_N = as.character(as.numeric(substr(CD_REFNIS, 1, 2)) * 1000)) %>%
  select(CD_REFNIS_N, POPULATION, TX_DESCR_FR) %>%
  group_by(CD_REFNIS_N) %>%
  summarize(POPULATION = sum(POPULATION)) 

statBel_muni <- left_join(statBel_muni, popBel_muni, by = c("cd_dstr_refnis" = "CD_REFNIS_N"))

# Clean up the URI column to extract the station IDs
stations <- stations %>%
  mutate(station_id = gsub("http://irail.be/stations/NMBS/", "", URI))

# Convert stations dataset to sf object
stations_sf <- stations %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)

Data Wrangling

We transformed raw data into a more usable format for analysis through the following processs. First, we combined the training and test datasets to manipulate the entire dataset at once. After we finished the data wrangling, we separated them again into training and test sets. Second, we recoded occupancy data from three-level (Low-medium-high) to two-levels (Low-high).

# Change station code from numeric to character and rbind train and test sets
trains_test$from <- as.character(trains_test$from)
trains_test$from <- paste0("00", trains_test$from) 
trains_test$to <- as.character(trains_test$to)
trains_test$to <- paste0("00", trains_test$to) 
trains_test$occupancy <- NA

# Rbind train and test sets
trains <- rbind(trains_train, trains_test)
#trains$occupancy <- factor(trains$occupancy, levels = c("low", "medium", "high"))
# Believe this causes regression error

# Convert date and time to POSIX datetime
trains$datetime <- as.POSIXct(paste(trains$date, trains$time), format = "%Y-%m-%d %I:%M:%S %p")
trains$week <- week(trains$datetime)
trains$dotw <- wday(trains$datetime, label = TRUE)

# Update vehicle ID
trains <- trains %>%
  mutate(vehicle = case_when(
    grepl("^\\d+$", vehicle) & vehicle %in% c("7006", "7966", "8011", "8015") ~ paste0("P", vehicle),
    grepl("^\\d+$", vehicle) ~ paste0("IC", vehicle),
    TRUE ~ vehicle
  ))

# Recode Low-medium-high to Low-high
trains$occupancy_original <- trains$occupancy
trains <- trains %>%
  mutate(occupancy = recode(occupancy, "medium" = "high")) %>%  # (1) Recode medium to high
#  mutate(occupancy = ifelse(occupancy == "medium", NA_character_, occupancy)) %>%
  # (2) Remove medium to high
  mutate(interval60=ymd_h(substr(datetime, 1, 13))) %>%
  mutate(week=week(interval60)) 

3. Exploratory Data Analysis

Distribution of Train Occupancy Levels

# Occupancy distribution
trains %>%
  subset(!is.na(occupancy)) %>%
  count(occupancy) %>%
  ggplot(aes(x = occupancy, y = n, fill = occupancy)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = colorPallete) +
  labs(
    title = "Distribution of Train Occupancy Levels",
    subtitle = "Training data, July - Oct 2016",
    x = "Occupancy Level",
    y = "Count of Trains"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "none")

Distribution of Train Occupancy Levels by Time Trend

Train Occupancy Trend by Hour of the Day

# Time trend
trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(hour = hour(datetime), occupancy) %>%
  count() %>%
  ggplot(aes(x = factor(hour), y = n, fill = occupancy)) +  
  geom_bar(stat = "identity") +  
  scale_fill_manual(values = colorPallete) +  
  labs(
    title = "Train Occupancy Trend by Hour of the Day",
    subtitle = "Training data, July - Oct 2016",
    x = "Hour of Day",
    y = "Count of Trains",
    fill = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(hour = hour(datetime)) %>%
  count(occupancy) %>%
  ggplot(aes(x = hour, y = n, color = occupancy)) +
  geom_line() +
  scale_color_manual(values = colorPallete) +
  labs(
    title = "Train Occupancy Trend by Hour of the Day",
        subtitle = "Training data, July - Oct 2016",
    x = "Hour of Day",
    y = "Count"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

Train Occupancy Trend by Day of Week

trains %>%
  subset(!is.na(occupancy)) %>%
  mutate(date = as.Date(datetime)) %>%  
  group_by(dotw, hour = hour(datetime), occupancy) %>%
  count() %>%
  ggplot(aes(x = dotw, y = n, fill=occupancy)) +  # Use color for lines
  geom_bar(stat = "identity") +  # Stacked bar chart
  scale_fill_manual(values = colorPallete) +  
  labs(
    title = "Train Occupancy Trend by Day of Week",
    subtitle = "Training data, July - Oct 2016",
    x = "Day",
    y = "Count of Trains",
    color = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

trains %>%
  subset(!is.na(occupancy)) %>%
  mutate(date = as.Date(datetime)) %>%  
  group_by(dotw, occupancy) %>%
  summarise(n = n(), .groups = "drop") %>%  # Use summarise instead of count
  ggplot(aes(x = dotw, y = n, color = occupancy, group = occupancy)) +
  geom_line(size = 1) +
  scale_color_manual(values = colorPallete) +  
  labs(
    title = "Train Occupancy Trend by Day of Week",
            subtitle = "Training data, July - Oct 2016",
    x = "Day",
    y = "Count of Trains",
    color = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

Train Occupancy Trend by Date

trains %>%
  subset(!is.na(occupancy)) %>%
  mutate(date = as.Date(datetime)) %>%  
  group_by(date, hour = hour(datetime), occupancy) %>%
  count() %>%
  ggplot(aes(x = date, y = n, fill=occupancy)) +  # Use color for lines
  geom_bar(stat = "identity") +  # Stacked bar chart
  scale_fill_manual(values = colorPallete) +  
  labs(
    title = "Train Occupancy Trend by Date",
    subtitle = "Training data, July - Oct 2016",
    x = "Date and Hour",
    y = "Count of Trains",
    color = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

trains %>%
  subset(!is.na(occupancy)) %>%
  group_by(date = date(datetime)) %>%
  count(occupancy) %>%
  ggplot(aes(x = date, y = n, color = occupancy)) +
  geom_line() +
  scale_color_manual(values = colorPallete) +
  labs(
    title = "Train Occupancy Trend by Date",
        subtitle = "Training data, July - Oct 2016",
    x = "Date",
    y = "Count"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

Location of Stations

# Create a leaflet map
leaflet(stations_sf) %>%
  addTiles() %>%
  addCircleMarkers(
    label = ~name,
    radius = 5,
    color = "blue",
    fillOpacity = 0.7
  ) %>%
  addLegend("bottomright", colors = "blue", labels = "Stations", title = "Station Map")

Location of Stations by Zone

ggplot()+
  geom_sf(data=statBel_muni %>% st_transform(4326), aes(fill=POPULATION), color="#333")+
  geom_sf(data=stations_sf , aes(color=Zone), size=1)+
   geom_sf_text(data = stations_sf ,
               aes(label = name),  
               size = 3, color = "#fff")+
  theme_void()+plotTheme+
  theme(legend.position = "bottom")

Weather data

stations_with_pop <- st_join(stations_sf, statBel_muni) %>%
  st_drop_geometry()
stations_with_pop <- stations_with_pop %>%
  bind_cols(as.data.frame(st_coordinates(stations_sf))) %>%
  rename(longitude = X, latitude = Y)

# Add from and to station names and left join population 
trains <- trains %>%
  left_join(stations_with_pop, by = c("from" = "station_id")) 
trains <- trains %>%
  select(date, time, from, to, vehicle, occupancy_original, occupancy, interval60, week, datetime, dotw, name, avg_stop_times, longitude, latitude, cd_dstr_refnis, POPULATION) %>%
  rename(from_station = name) %>%
  rename(from_avg_stop_times = avg_stop_times) %>%
  rename(from_lng = longitude) %>%
  rename(from_lat = latitude) %>%
  rename(from_pop = POPULATION) %>%
  rename(from_cd_dstr_refnis = cd_dstr_refnis)

trains <- trains %>%
  left_join(stations_with_pop, by = c("to" = "station_id")) 
trains <- trains %>%
  select(date, time, from, to, vehicle, occupancy_original, occupancy, interval60, week, datetime, dotw, from_station, from_avg_stop_times, from_lng, from_lat, from_pop, from_cd_dstr_refnis,  name, avg_stop_times,  longitude, latitude, cd_dstr_refnis,POPULATION) %>%
  rename(to_station = name) %>%
  rename(to_avg_stop_times = avg_stop_times) %>%
  rename(to_lng = longitude) %>%
  rename(to_lat = latitude) %>%
  rename(to_pop = POPULATION) %>%
  rename(to_cd_dstr_refnis = cd_dstr_refnis)

# Load weather data
weatherstations_sf <- riem_stations(network = "BE__ASOS") %>% # BE__ASOS = Belgium ASOS (Weather Stations)
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
weatherstations <- weatherstations_sf %>% 
  pull(id)
weather_belgium <- data.frame()

for (station in weatherstations) {        # Loop through each station
  weather_data <- riem_measures(          # Fetch data for the current station
    station = station,
    date_start = "2016-07-27",
    date_end = "2016-12-19"
  ) %>%
    select(valid, tmpf, relh, sknt) %>%   # Select relevant columns (timestamp, temperature by Fahrenheit, relative humidity in percentage, wind speed in knots)
    mutate(station_id = station)          # Add a column to identify the station
  weather_belgium <- bind_rows(weather_belgium, weather_data) # Append the fetched data to the main data frame
}

weather_belgium <- weather_belgium %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>% # Extract the timestamp and convert it to a POSIXct object
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  group_by(interval60, station_id) %>%
  summarize(Temp = (max(tmpf, na.rm = TRUE)-32)*5/9, # Temperature by Celsius
            Humid = max(relh, na.rm = TRUE), # relative humidity in percentage
            Wind = max(sknt, na.rm = TRUE)*0.51444) # Wind speed in meters per second

# Convert weather dataset to sf object
weatherstations_with_muni <- st_join(weatherstations_sf, statBel_muni) %>%
  st_drop_geometry()
# Find nearest weather station for municipalities without a station
statBel_muni <- statBel_muni %>%
  mutate(
    weatherstation_within = st_intersects(geometry, weatherstations_sf, sparse = FALSE) %>%
      apply(1, function(x) if (any(x)) weatherstations_sf$id[which(x)[1]] else NA)  # Replace `id` with the column identifying stations
  )
statBel_muni <- statBel_muni %>%
  mutate(
    nearest_weatherstation = ifelse(
      is.na(weatherstation_within),
      weatherstations_sf$id[st_nearest_feature(geometry, weatherstations_sf)],  # Replace `id` with station identifier
      weatherstation_within
    )
  )
statBel_muni <- statBel_muni %>% select(-weatherstation_within)

# left join weather station data
trains <- trains %>%
  left_join(statBel_muni, by=c("from_cd_dstr_refnis" = "cd_dstr_refnis")) %>%
  rename(from_nearest_weatherstation = nearest_weatherstation) %>%
  select(-geometry, -POPULATION)

trains <- trains %>%
  left_join(statBel_muni, by=c("to_cd_dstr_refnis" = "cd_dstr_refnis")) %>%
  rename(to_nearest_weatherstation = nearest_weatherstation) %>%
  select(-geometry, -POPULATION)

# left join weather data
trains <- trains %>%
  left_join(weather_belgium, by = c("interval60" = "interval60", "from_nearest_weatherstation" = "station_id")) %>%
  mutate(from_temp = Temp) %>%
  mutate(from_humid = Humid) %>%
  mutate(from_wind = Wind) %>%
  select(-Temp, -Humid, -Wind)  

trains <- trains %>%
  left_join(weather_belgium, by = c("interval60" = "interval60", "to_nearest_weatherstation" = "station_id")) %>%
  mutate(to_temp = Temp) %>%
  mutate(to_humid = Humid) %>%
  mutate(to_wind = Wind) %>%
  select(-Temp, -Humid, -Wind) 

trains <- trains %>%
  mutate(from_humid = case_when(is.infinite(from_humid) ~ NA_real_,
            TRUE ~ from_humid)) %>%
  mutate(to_humid = case_when(is.infinite(to_humid) ~ NA_real_,
            TRUE ~ to_humid))

# Left join line data
trains <- trains %>%
   left_join(line_info, by = c("vehicle" = "vehicle_id")) 

trains <- trains %>%
  mutate(
    vehicle_type = ifelse(
      is.na(vehicle_type) | vehicle_type == "",  
      gsub("[0-9]", "", vehicle),           
      vehicle_type                          
    )
  )

trains <- trains %>%
  filter(!(vehicle_type %in% c("EUR", "ICE", "ICT", "TGV", "TRN", "undefined", "(null)")))


# Create event data
trains <- trains %>%
  mutate(event = case_when(dotw == "Sat" ~ 1,
                           dotw == "Sun" ~ 1,
                           ymd(interval60) >= "2016-07-28" & ymd(interval60) <= "2016-09-04" ~ 1,
                           ymd(interval60) == "2016-11-03" | ymd(interval60) == "2016-11-04" | ymd(interval60) == "2016-11-05" ~ 1,
                           ymd(interval60) == "2016-11-13" | ymd(interval60) == "2016-11-14" | ymd(interval60) == "2016-11-15" ~ 1,
                           TRUE ~ 0))
ggplot()+
  geom_sf(data=statBel_muni , fill="#a1b7b8", color="#333")+
  geom_sf(data=weatherstations_sf, color="#df543b", size=1)+
  geom_sf_text(data = statBel_muni ,
               aes(label = cd_dstr_refnis),  
               size = 3, color = "black") +
  labs(
    title = "Weather Station Location in Belgium",
        subtitle = "Weather data from RIEM package",
    caption = "Source: StatBel, RIEM package by Maelle Salmon"
  ) +
  theme_void()+plotTheme

grid.arrange(top = "Weather DATA - Belgium - Jul - Dec 2016",
              ggplot(weather_belgium, aes(interval60, Humid)) + geom_line(aes(color=station_id))+plotTheme,
              ggplot(weather_belgium, aes(interval60, Wind)) + geom_line(aes(color=station_id))+plotTheme,
              ggplot(weather_belgium, aes(interval60, Temp)) + geom_line(aes(color=station_id))+plotTheme)

Top 25 Count of Trains OD Pairs

# Filter high occupancy
high_occupancy <- trains %>% subset(!is.na(occupancy)) %>%
  filter(occupancy == "high") %>%
  count(from_station, to_station) %>%
  arrange(desc(n)) %>%
  rename(n_high = n)

low_occupancy <- trains %>% subset(!is.na(occupancy)) %>%
  filter(occupancy == "low") %>%
  count(from_station, to_station) %>%
  arrange(desc(n))%>%
  rename(n_low = n)

occupancy <- high_occupancy %>%
  left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) 

occupancy_data <- trains %>%
  subset(!is.na(occupancy)) %>%
  count(from_station, to_station, occupancy) %>%
  group_by(from_station, to_station) %>%
  mutate(total = sum(n)) 

occupancy_data %>%
  filter(!is.na(to_station)) %>%
   ungroup() %>% 
  arrange(desc(total)) %>%
  slice_head(n = 70) %>%  # Top 70 OD pairs
  ggplot(aes(
    x = reorder(paste(from_station, to_station, sep = " → "), total),
    y = n,
    fill = occupancy
  )) +
  geom_bar(stat = "identity") +  # Stacked bar chart
  coord_flip() +
  scale_fill_manual(values = colorPallete) +  
  labs(
    title = "Top 25 Count of Trains OD Pairs",
    subtitle = "Training data, July - Oct 2016",
    x = "OD Pair",
    y = "Count of Trains",
    fill = "Occupancy Level"
  ) +
  theme_minimal()+plotTheme+
  theme(legend.position = "bottom")

Top 10 High Occupancy Count of Trains OD Pairs

# Plot top 10 OD pairs with high occupancy
high_occupancy %>%
  filter(!is.na(to_station)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), n_high), y = n_high)) +
  geom_bar(stat = "identity", fill = "#a1b7b8") +
  coord_flip() +
  labs(
    title = "Top 10 High-Occupancy OD Pairs",
    subtitle = "Training data, July - Oct 2016",
    x = "OD Pair",
    y = "Count"
  ) +
  theme_minimal()+plotTheme

Top 10 Low Occupancy Count of Trains OD Pairs

# Plot top 10 OD pairs with low occupancy
low_occupancy %>%
  filter(!is.na(to_station)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), n_low), y = n_low)) +
  geom_bar(stat = "identity", fill = "#df543b") +
  coord_flip() +
  labs(
    title = "Top 10 Low-Occupancy OD Pairs",
    subtitle = "Training data, July - Oct 2016",
    x = "OD Pair",
    y = "Count"
  ) +
  theme_minimal()+plotTheme

# Create and left join the frequency (count) of high and low occupancy data of origin and destination station by each OD pair
trains <- trains %>%
  left_join(high_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) 
trains <- trains %>%
  left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) 

# Calculate the distance from O to Brussels, from D to Brussels, and from O to D in kilometers
trains <- trains %>%
  mutate(
    dist_from_O_to_Brussels = pmap_dbl(
      list(from_lng, from_lat), 
      ~ distHaversine(c(4.3517, 50.8503), c(..1, ..2)) / 1000
    ),
    dist_from_D_to_Brussels = pmap_dbl(
      list(to_lng, to_lat),
      ~ distHaversine(c(4.3517, 50.8503), c(..1, ..2)) / 1000
    ),
    dist_from_O_to_D = pmap_dbl(
      list(from_lng, from_lat, to_lng, to_lat),
      ~ distHaversine(c(..1, ..2), c(..3, ..4)) / 1000
    )
  )

# Calculate the bearing angle from O to D
deg2rad <- function(deg) {
  return(deg * pi / 180)
}
rad2deg <- function(rad) {
  return(rad * 180 / pi)
}
calculate_bearing <- function(lat1, lon1, lat2, lon2) {
  phi1 <- deg2rad(lat1)
  phi2 <- deg2rad(lat2)
  delta_lambda <- deg2rad(lon2 - lon1)
  
  theta <- atan2(
    sin(delta_lambda) * cos(phi2),
    cos(phi1) * sin(phi2) - sin(phi1) * cos(phi2) * cos(delta_lambda)
  )
  
  bearing <- (rad2deg(theta) + 360) %% 360
  return(bearing)
}

trains$bearing_from_O_to_D <- calculate_bearing(trains$from_lat, trains$from_lng, trains$to_lat, trains$to_lng)

trains$bearing_from_O_to_D_cat <- case_when(
  trains$bearing_from_O_to_D >= 337.5 | trains$bearing_from_O_to_D < 22.5 ~ "N",
  trains$bearing_from_O_to_D >= 22.5 & trains$bearing_from_O_to_D < 67.5 ~ "NE",
  trains$bearing_from_O_to_D >= 67.5 & trains$bearing_from_O_to_D < 112.5 ~ "E",
  trains$bearing_from_O_to_D >= 112.5 & trains$bearing_from_O_to_D < 157.5 ~ "SE",
  trains$bearing_from_O_to_D >= 157.5 & trains$bearing_from_O_to_D < 202.5 ~ "S",
  trains$bearing_from_O_to_D >= 202.5 & trains$bearing_from_O_to_D < 247.5 ~ "SW",
  trains$bearing_from_O_to_D >= 247.5 & trains$bearing_from_O_to_D < 292.5 ~ "W",
  trains$bearing_from_O_to_D >= 292.5 & trains$bearing_from_O_to_D < 337.5 ~ "NW"
)

# Categorize time
time_category <- function(time) {
  hour <- as.numeric(format(time, "%H"))
  if ((hour >= 7 & hour < 9) | (hour >= 17 & hour < 19)) {
    return("commute time")
  } else if (hour >= 5 & hour < 7) {
    return("morning")
  } else if (hour >= 9 & hour < 17) {
    return("noon")
  } else if (hour >= 19 & hour < 24) {
    return("evening")
  } else {
    return("midnight")
  }
}



trains <- trains %>%
  mutate(time_cat = sapply(datetime, time_category))
trains$time_cat <- factor(trains$time_cat, levels = c( "morning", "commute time", "noon", "evening", "midnight"))
# Plot top 25 OD pairs with low occupancy percentage
high_occupancy %>%
  left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) %>%
  filter(!is.na(to_station)) %>%
  mutate(perc = n_low/(n_high+n_low)*100) %>%
  arrange(desc(perc))%>%
  head(25) %>%
  ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), perc), y = perc)) +
  geom_bar(stat = "identity", fill = "#df543b") +
  coord_flip() +
  labs(
    title = "Top 25 Low-Occupancy Percentage OD Pairs",
    subtitle = "Training data, July - Oct 2016",
    x = "OD Pair",
    y = "Percentage of Low Occupancy (%)"
  ) +
  theme_minimal()+plotTheme

# Distribution summary
#summary(high_occupancy$n)
#summary(low_occupancy$n)
#  group_by(from_station, to_station) %>%
#  summarise(n = sum(n)) %>%
#  ungroup()

# Visualize distribution 

ggplot(low_occupancy, aes(x = n_low)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "white") +
  labs(
    title = "Distribution of Low Occupancy Counts",
    x = "Count (n)",
    y = "Frequency"
  ) +
  theme_minimal()+plotTheme

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, from_avg_stop_times, to_avg_stop_times, nr_of_stops) %>%
  gather(Variable, value, -occupancy) %>%
  ggplot(aes(occupancy, value, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary", fun="mean") +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Mean", 
       title="Feature associations with the likelihood of occupancy Level",
       subtitle="Average stop times of origin and destination station and the number of stops of each line\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_summary()`).

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, from_wind, to_wind, from_temp, to_temp, from_humid, to_humid) %>%
  gather(Variable, value, -occupancy) %>%
  ggplot(aes(occupancy, value, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary", fun="mean") +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Mean", 
       title="Feature associations with the likelihood of occupancy Level",
       subtitle="Temperature and wind of origin and destination station\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## Warning: Removed 568 rows containing non-finite outside the scale range
## (`stat_summary()`).

train_summary <- trains %>%
  filter(!is.na(occupancy)) %>% 
  group_by(dotw, time_cat) %>%
  summarize(
    total_count = n(),  
    low_count = sum(occupancy == "low", na.rm = TRUE),  
    low_ratio = low_count / total_count*100  
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'dotw'. You can override using the
## `.groups` argument.
ggplot(train_summary, aes(x = dotw, y = low_ratio, fill=time_cat)) +
  geom_bar(stat = "identity", position = "dodge") +  
  scale_fill_manual(
    values = c(
      "morning" = "#1f78b4",      
      "commute time" = "#33a02c", 
      "noon" = "#e31a1c",         
      "evening" = "#ff7f00",      
      "midnight" = "#6a3d9a"      
    )
  ) +
  labs(
    title="Feature associations with the likelihood of occupancy Level",
    subtitle="Low occupancy rate by time category of each day\nTraining data, July - Oct 2016", 
    x = "Day of Week",
    y = "Low occupancy rate (%)", 
    caption="Source: Kaggle"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")+plotTheme

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, dist_from_O_to_D, dist_from_O_to_Brussels, dist_from_D_to_Brussels) %>%
  gather(Variable, value, -occupancy) %>%
  ggplot(aes(occupancy, value, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary", fun="mean") +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Mean distance (km)", 
       title="Feature associations with the likelihood of occupancy Level",
       subtitle="Distance from O, D to Brussels, and from O to D\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## Warning: Removed 85 rows containing non-finite outside the scale range
## (`stat_summary()`).

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, event) %>%
  gather(Variable, value, -occupancy) %>%
  count(Variable, value, occupancy) %>%
  ggplot(aes(occupancy, n, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary") +
  facet_wrap(~value, scales="free") +
  scale_fill_manual(values=colorPallete) +
  ylim(0,1200)+
  labs(x="Occupancy", y="Count", 
       title="Feature associations with the likelihood of low occupancy",
       subtitle="Event: 0 = Weekday,  1 = Weekend or holiday\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, time_cat) %>%
  gather(Variable, value, -occupancy) %>%
  count(Variable, value, occupancy) %>%
  ggplot(aes(occupancy, n, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary") +
  ylim(0,650)+
  facet_wrap(~value, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Count", 
       title="Feature associations with the likelihood of low occupancy",
       subtitle="0700-0900 and 1700-1900 commute, 1900-2400 evening, 0000-0500 midnight, 0500-0700 morning, 0900-1700 noon\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, vehicle_type) %>%
  gather(Variable, value, -occupancy) %>%
  count(Variable, value, occupancy) %>%
  ggplot(aes(occupancy, n, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary") +
  ylim(0,1000)+
  facet_wrap(~value, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Count", 
       title="Feature associations with the likelihood of low occupancy",
       subtitle="Vehicle type for IC, L, P, S, and THA\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

trains %>%
  filter(!is.na(occupancy)) %>% 
  select(occupancy, bearing_from_O_to_D_cat) %>%
  gather(Variable, value, -occupancy) %>%
  count(Variable, value, occupancy) %>%
  ggplot(aes(occupancy, n, fill=occupancy)) +
  geom_bar(position="dodge", stat="summary") +
  ylim(0,250)+
  facet_wrap(~value, scales="free") +
  scale_fill_manual(values=colorPallete) +
  labs(x="Occupancy", y="Count", 
       title="Feature associations with the likelihood of low occupancy",
       subtitle="Vehicle type for IC, L, P, S, and THA\nTraining data, July - Oct 2016", 
       caption="Source: Kaggle")+
  theme_minimal()+plotTheme+
  theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

threshold <- 3 # Moderate threshold capturing higher-demand OD pairs

# Filter high-occupancy OD pairs
high_routes <- high_occupancy %>%
  filter(n_high >= threshold) %>%
  left_join(stations, by = c("from_station" = "name")) %>%
  rename(from_lon = longitude, from_lat = latitude) %>%
  left_join(stations, by = c("to_station" = "name")) %>%
  rename(to_lon = longitude, to_lat = latitude)

low_routes <- low_occupancy %>%
  filter(n_low >= threshold) %>%
  left_join(stations, by = c("from_station" = "name")) %>%
  rename(from_lon = longitude, from_lat = latitude) %>%
  left_join(stations, by = c("to_station" = "name")) %>%
  rename(to_lon = longitude, to_lat = latitude)

# Create line data for leaflet polylines
high_routes <- high_routes %>%
  mutate(route_label = paste(from_station, "->", to_station))

low_routes <- low_routes %>%
  mutate(route_label = paste(from_station, "->", to_station))

# Step 4: Create line data for leaflet polylines
routes_lines <- high_routes %>%
  rowwise() %>%
  do({
    data.frame(
      lng = c(.$from_lon, .$to_lon),
      lat = c(.$from_lat, .$to_lat),
      route_label = .$route_label
    )
  })

routes_lines_l <- low_routes %>%
  rowwise() %>%
  do({
    data.frame(
      lng = c(.$from_lon, .$to_lon),
      lat = c(.$from_lat, .$to_lat),
      route_label = .$route_label
    )
  })

# Create a color palette for the zones
zone_colors <- colorFactor(
  palette = "Dark2",
  domain = stations_sf$Zone # Map unique zone values
)
# Create Leaflet map
leaflet() %>%
  addTiles() %>%
  # Add station markers
  addCircleMarkers(
    data = stations_sf,
    label = ~name,
    radius = 3,
    color = ~zone_colors(Zone),
    fillOpacity = 0.1
  ) %>%
  # Add OD pair lines

  addPolylines(
    data = routes_lines,
    lng = ~lng,
    lat = ~lat,
    color = "#a1b7b8",
    opacity = 0.3,
    weight = 2,
    label = ~route_label
  ) %>%
    addPolylines(
    data = routes_lines_l,
    lng = ~lng,
    lat = ~lat,
    color = "#df543b",
    opacity = 0.1,
    weight = 2,
    label = ~route_label
  ) %>%
  
  
  # Add legend
  addLegend(
    "bottomright",
    colors = c("black", "#a1b7b8",  "#df543b"),
    labels = c("Stations", "High Occupancy Routes", "Low Occupancy Routes"),
    title = "Routes Map"
  )

4. Test of Logistic Regression Model (Original Model)

trains <- trains %>%
  mutate(date_numeric = as.numeric(as.Date(date))) %>%
  mutate(time_numeric = as.numeric(as_hms(time))) %>%
  mutate(occupancy_numeric = case_when(
    occupancy == "low" ~ 1,
    occupancy == "high" ~ 0,
    TRUE ~ NA_real_
  ))

# Split trains into trains_train and trains_test  
trains_train <- trains %>% filter(!is.na(occupancy)) 
trains_test <- trains %>% filter(is.na(occupancy))

# Split trains_train into trains_trainTrain and trains_trainTest
set.seed(3456)
trainIndex <- createDataPartition(trains_train$occupancy, p = .65, 
                                  list = FALSE,
                                  times = 1)
trains_trainTrain <- trains_train[ trainIndex,]
trains_trainTest  <- trains_train[-trainIndex,]

reg <- glm(occupancy_numeric ~ ., data = 
                    trains_trainTrain %>% dplyr::select(date_numeric, time_numeric, dotw, occupancy_numeric),
                family = "binomial"(link = "logit"))

a<-summary(reg) # / AIC: 1,987
pseudo_r2 <- pR2(reg)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
kable(pseudo_r2_df, "html", caption = "Pseudo-R² Values for `train logistic model`") %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
Pseudo-R² Values for train logistic model
llh llhNull G2 McFadden r2ML r2CU
-984.2853 -1008.389 48.20693 0.0239029 0.0318569 0.042939
testProbs <- data.frame(Outcome = as.numeric(trains_trainTest$occupancy_numeric),
                        Probs = predict(reg, trains_trainTest, type= "response"))
#testProbs$Probs_p <- case_when(
#  testProbs$Probs >= 0.5 ~ 1,
#  testProbs$Probs < 0.5 ~ 0,
#  TRUE ~ NA
#)
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  xlim(0,1)+
  scale_fill_manual(values = c("#a1b7b8", "#df543b")) +
  labs(x = "Prediction of Occupancy", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+plotTheme

ggplot() +
  geom_roc(data= testProbs, aes(d = as.numeric(Outcome), m = Probs), n.cuts = 50, labels = FALSE) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve ",
       subtitle = "trains_trainTest that is split of trains_train") +
  theme(legend.position = "bottom")+plotTheme

pROC::auc(testProbs$Outcome, testProbs$Probs) # AUC = 0.6277
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6277

Cost Benefit Analysis

5. Test of Logistic Regression Model (Improved Model)

reg <- glm(occupancy_numeric ~ ., data = 
                    trains_trainTrain %>% dplyr::select(date_numeric, time_numeric, dotw, occupancy_numeric, 
                                                        
     from_avg_stop_times, from_pop,  from_avg_stop_times, to_avg_stop_times, to_pop, from_temp, from_wind, from_humid, to_temp, to_wind, to_humid, event, nr_of_stops, dist_from_O_to_D, dist_from_O_to_Brussels, dist_from_D_to_Brussels, time_cat, vehicle_type, bearing_from_O_to_D_cat),
                family = "binomial"(link = "logit"))

b<-summary(reg) # / AIC: 1,651
pseudo_r2 <- pR2(reg)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
kable(pseudo_r2_df, "html", caption = "Pseudo-R² Values for `train logistic model`") %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
Pseudo-R² Values for train logistic model
llh llhNull G2 McFadden r2ML r2CU
-786.4662 -883.1156 193.2989 0.1094414 0.1377718 0.1856972
testProbs <- data.frame(Outcome = (trains_trainTest$occupancy_numeric),
                        Probs = predict(reg, trains_trainTest, type= "response"))


ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  xlim(0,1)+
  scale_fill_manual(values = c("#a1b7b8", "#df543b")) +
  labs(x = "Prediction of Occupancy", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+plotTheme

testProbs <- testProbs %>%
  mutate(predOutcome = (ifelse(Probs >= 0.45,1,0))) # Threshold = 0.45
testProbs %>%
  filter( !is.na(predOutcome)) %>% 
  summarize(observedLow=sum(Outcome), predictedLow=sum(predOutcome)) %>%

  gather(Variable, Value) %>%
  ggplot(aes(x = Variable, y = Value))+
  geom_bar(position="dodge", stat="identity")

ggplot() +
  geom_roc(data= testProbs, aes(d = as.numeric(Outcome), m = Probs), n.cuts = 50, labels = FALSE) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve ",
       subtitle = "trains_trainTest that is split of trains_train") +
  theme(legend.position = "bottom")+plotTheme

pROC::auc(testProbs$Outcome, testProbs$Probs) # AUC = 0.6764
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6764

Cost Benefit Analysis

6. Prediction

7. Conclusion and Discussion

Conclusion

Discussion